prov/efa: support hardware counter#12114
Conversation
63d3a4a to
4e231eb
Compare
4e231eb to
3c6c1e2
Compare
3c6c1e2 to
37a80cc
Compare
0c7d981 to
b0a5ead
Compare
|
@mrgolin Could you review this? Thanks! |
501a4be to
a2dc8f3
Compare
| size_t qp_table_sz_m1; | ||
| struct ofi_genlock qp_table_lock; | ||
| int urandom_fd; | ||
| uint32_t max_comp_cntr; |
There was a problem hiding this comment.
nit: a comment here explaining what this field is used for?
| efa_device->device_caps = 0; | ||
| #endif | ||
| efa_device->max_comp_cntr = 0; | ||
| #if HAVE_IBV_DEVICE_ATTR_EX_MAX_COMP_CNTR |
There was a problem hiding this comment.
nit: move to a function? max_comp_counter_set
|
|
||
| cntr = container_of(cntr_fid, struct efa_cntr, util_cntr.cntr_fid); | ||
|
|
||
| /* Progress CQ to complete WQE in SQ and RQ */ |
There was a problem hiding this comment.
I know this is for avoiding cq overrun and resource management (WQ), but I still think this is not desirable per the goal of hw cntr read: a cheaper way to get the completion numbers without involving heavy weighted CQ poll. If we want a way to avoid cq overrun, that can be a documented requirement for application, or a separate change to protect cq overrun elsewhere. Meanwhile, the efa-direct fabric support FI_PROGRESS_AUTO which doesn't require application use fi_cntr_read to progress the completions. So polling cq here is awkward to me.
There was a problem hiding this comment.
Since efa-direct claims FI_RM_DISABLED, resource management is the application's responsibility. NCCL GIN will read the counter value directly from hardware without calling fi_cntr_read, so it needs to poll the CQ separately to reclaim queue resources. I want to remove this internal cq polling from the hardware counter path to make this consistent.
There was a problem hiding this comment.
efa-direct claims FI_RM_ENABLED today:
libfabric/prov/efa/src/efa_prov_info.c
Lines 103 to 104 in 0ede325
There was a problem hiding this comment.
I thought we agreed that we do need to poll the CQ in the fi_cntr_read() path, even with HW counters?
There was a problem hiding this comment.
I just want to call out this doesn't make much sense even if it is the safest approach. Also as I can tell we still bump util counters in efa_cq_poll_ibv_cq , because the PR still bind the cntr to util_ep in fi_ep_bind. Then why don't we read from util cntrs except for FI_REMOTE_WRITE (where there is no completions on the target side of fi_write) which even doesn't have any hardware limit
There was a problem hiding this comment.
This is not the correct thing to do in all cases, please make sure the team is aligned and switch the implementation to the decided on approach.
|
|
||
| cntr = container_of(cntr_fid, struct efa_cntr, util_cntr.cntr_fid); | ||
|
|
||
| /* Progress CQ to complete WQE in SQ and RQ */ |
There was a problem hiding this comment.
This is not the correct thing to do in all cases, please make sure the team is aligned and switch the implementation to the decided on approach.
9aa9bff to
cdaeb21
Compare
Add per-signal and per-counter EFA endpoints with hardware completion
counters that write directly to GPU memory, enabling the GPU kernel to
poll signal/counter values without host involvement.
Counter memory is allocated via the CUDA VMM API (cuMemCreate with
gpuDirectRDMACapable) and exported as a DMA-BUF fd. The NIC writes
the counter value directly to GPU HBM via cntr_open_ext with
FI_EFA_MEMORY_LOCATION_DMABUF.
Changes:
- m4: Add configure probe for fi_efa_comp_cntr_init_attr
- nccl_ofi_cuda: Add nccl_net_ofi_gpu_vmm_alloc/free using CUDA VMM
API for RDMA-capable GPU allocations that support DMA-BUF export
- dev header: Define nccl_ofi_gin_dev_counter_handle struct
- resources: Add gdaki_hw_counter (VMM alloc + DMA-BUF + cntr_open_ext),
gdaki_sc_endpoint (EP + two hw counters + QP/CQ + per-peer addressing);
refactor gdaki_fi_endpoint into open() + enable() for counter binding
- createContext: Create sc_endpoints when nSignals/nCounters > 0,
wire signal_handles/counter_handles into the device handle
- Bump fi_getinfo to FI_VERSION(2, 5) so libfabric populates
domain_attr->max_cntr_value from device capabilities
- Teardown: endpoint declared after counters so QP is destroyed
before counters (required since counters are attached to QP)
Requires libfabric with hardware counter support (PR ofiwg/libfabric#12114).
Tested on p6b.200.48xlarge: data PASS, signal counter = 1 (PASS),
GPU-read signal counter = 1 (PASS), clean teardown.
Add per-signal and per-counter EFA endpoints with hardware completion
counters that write directly to GPU memory, enabling the GPU kernel to
poll signal/counter values without host involvement.
Counter memory is allocated via the CUDA VMM API (cuMemCreate with
gpuDirectRDMACapable) and exported as a DMA-BUF fd. The NIC writes
the counter value directly to GPU HBM via cntr_open_ext with
FI_EFA_MEMORY_LOCATION_DMABUF.
Changes:
- m4: Add configure probe for fi_efa_comp_cntr_init_attr
- nccl_ofi_cuda: Add nccl_net_ofi_gpu_vmm_alloc/free using CUDA VMM
API for RDMA-capable GPU allocations that support DMA-BUF export
- dev header: Define nccl_ofi_gin_dev_counter_handle struct
- resources: Add gdaki_hw_counter (VMM alloc + DMA-BUF + cntr_open_ext),
gdaki_sc_endpoint (EP + two hw counters + QP/CQ + per-peer addressing);
refactor gdaki_fi_endpoint into open() + enable() for counter binding
- createContext: Create sc_endpoints when nSignals/nCounters > 0,
wire signal_handles/counter_handles into the device handle
- Bump fi_getinfo to FI_VERSION(2, 5) so libfabric populates
domain_attr->max_cntr_value from device capabilities
- Teardown: endpoint declared after counters so QP is destroyed
before counters (required since counters are attached to QP)
Requires libfabric with hardware counter support (PR ofiwg/libfabric#12114).
Tested on p6b.200.48xlarge: data PASS, signal counter = 1 (PASS),
GPU-read signal counter = 1 (PASS), clean teardown.
cntr_cnt in domain_attr is the optimal number of completion counters supported by the domain. According to man page, it may be a fixed value of the maximum number of counters supported by the underlying hardware, or may be a dynamic value, based on the default attributes of the domain. Set it as the maximum number of counters supported by EFA device, or leave it as 0 when hardware counter is not supported. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
For efa-direct, set max_cntr_value and max_err_cntr_value via fi_getinfo based on the comp_count_max_value and err_count_max_value from EFA device and user hints. The protocol path cannot use hardware counter because it generates multiple completion events per user operation. For API version < 2.5, default to UINT64_MAX. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
cuMemCreate with gpuDirectRDMACapable and exported as a DMA-BUF fd.
The NIC writes the counter value directly to GPU HBM via cntr_open_ext with
FI_EFA_MEMORY_LOCATION_DMABUF.
Changes:
- m4: Add configure probe for fi_efa_comp_cntr_init_attr; defines
HAVE_FI_EFA_COMP_CNTR when libfabric exposes the type. All hardware-
counter code paths added by this commit are guarded on this macro.
- nccl_ofi_cuda: Bind 9 cuMem* / cuDeviceGet driver functions and
add nccl_net_ofi_gpu_vmm_alloc/free using the CUDA VMM API. The
allocation requests gpuDirectRDMACapable so the buffer supports
DMA-BUF export, which is what cntr_open_ext requires.
- dev header: Define nccl_ofi_gin_dev_counter_handle (qp, cq,
cntr_value pointer, per-peer addressing). Add counter_handles,
signal_handles, nCounters, nSignals on nccl_ofi_gin_gdaki_dev_handle.
Layout is shared with NCCL's mirror struct in
nccl_device/gin/efa_gda/gin_efa_gda_dev.h.
- resources header / cpp: Add gdaki_hw_counter (RAII over the VMM
allocation + DMA-BUF fd + cntr_open_ext) and gdaki_sc_endpoint (EP
+ write_cntr + remote_write_cntr + QP/CQ + per-peer addressing +
two device handles, one for the WRITE counter and one for the
REMOTE_WRITE counter). Endpoint is declared after the counters so
C++ destructs the QP before the counters (binding requirement).
gdaki_fi_endpoint::open is split into open() (binds CQ + AV but
does not call fi_enable) plus enable(), so callers can bind
additional resources (such as counters) before enabling.
- createContext: When nSignals or nCounters is nonzero, allocate
max(nSignals, nCounters) sc_endpoints, allgather each one's
fi_addr, and populate per-peer addressing. Build GPU-resident
arrays of nccl_ofi_gin_dev_counter_handle pointers
(d_counter_handles, d_signal_handles), patch each handle's
cntr_value to the appropriate counter (FI_WRITE for counters,
FI_REMOTE_WRITE for signals), and wire the array pointers into
the device handle.
- createContext: Bump fi_getinfo to FI_VERSION(2, 5) on the GIN
proxy info path (nccl_ofi_gin_resources.cpp::get_gin_info) and
on the rdma init path (nccl_ofi_rdma.cpp::nccl_net_ofi_rdma_init)
so libfabric reports the hardware-counter capability.
- createContext: Call ctx->endpoint.enable() explicitly after open()
now that gdaki_fi_endpoint::open no longer enables.
Requires libfabric with hardware counter support
(ofiwg/libfabric#12114). When the libfabric headers do not expose
fi_efa_comp_cntr_init_attr, HAVE_FI_EFA_COMP_CNTR is undefined and the
sc_endpoint code path is compiled out; existing data-only Put behavior
is unchanged.
30f8e65 to
68ae03b
Compare
Implement hardware counter open/close and fi_ops_cntr operations (read, readerr, add, adderr, set, seterr, wait) that delegate to the corresponding ibv_*_comp_cntr functions from rdma-core. Application is responsible for calling fi_cq_read to prevent CQ overrun. SKip cntr add/adderr in the cq polling path for hardware counter. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
… memory Add cntr_open_ext to fi_efa_ops_gda to create hardware completion counters with optional application-provided external memory for the completion and error counts, enabling zero-copy observation of completion progress by co-located processes or devices. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Attach hardware completion counter to QP with ibv_qp_attach_comp_cntr after QP is created in RESET state during ep enable. We cannot do this during ep bind because QP is not created yet. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Add efa_hw_cntr_wait() which polls the hardware completion counter until it reaches the requested threshold or the timeout expires. Uses exponential backoff starting at 1 microsecond, doubling each iteration for up to 5 attempts, or repeat 1ms when user asked for infinite timeout. Also fixed efa_cntr_wait since it didn't handle infinite timeout correctly. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Add fi_efa_hw_cntr fabtest that exercises hardware counters through MSG pingpong operations. The test opens counters via cntr_open_ext from the GDA domain ops, binds them as txcntr/rxcntr, and uses the existing ft_get_cntr_comp path for completion tracking. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Add RMA write support to fi_efa_hw_cntr via the -o write option. This adds rma_write() and run_rma() functions, and the API_OPTS parsing to select between MSG pingpong (default) and RMA write. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Add --external-mem flag to fi_efa_hw_cntr that enables external user-provided memory mode. When set, the test allocates buffers and passes them via FI_EFA_MEMORY_LOCATION_VA with the FI_EFA_COMP_CNTR_INIT_WITH_EXTERNAL_MEM flag to cntr_open_ext. Add corresponding pytest cases for pingpong and RMA write with external memory. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Hardware counter requires firmware support. Add environment variable FI_EFA_USE_HW_CNTR that is not registered via fi_param_define so we can control when to enable it without exposing the variable to applications. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Guard all the entry points to hardware counter with FI_EFA_USE_HW_CNTR, which is default to false until we enable it. Enable fabtests and unit tests with FI_EFA_USE_HW_CNTR=1. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
EFA device does not support FI_SELECTIVE_COMPLETION and efa-direct is a hardware offloading component, so we shouldn't implement FI_SELECTIVE_COMPLETION in software. Specifically when applications use hardware counter, we are unable to support FI_SELECTIVE_COMPLETION because the device requires the CQ to be polled to avoid CQ overrun and reclaim wr id. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>
Implement cntr_open_ext in fi_efa_ops_gda to create hardware completion
counters using ibv_create_comp_cntr from rdma-core.
Application can optionally provide its own memory for the completion and error
counts, enabling zero-copy observation of completion progress by
co-located processes or devices.
Implement fi_ops_cntr operations (read, readerr, add, adderr, set,
seterr) that delegate to the corresponding ibv_*_comp_cntr functions.